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PL| , Abstract 

Q i Context-Free Grammars (CFGs) and Parsing Expression Grammars (PEGs) 

have several similarities and a few differences in both their syntax and seman- 
tics, but they are usually presented through formalisms that hinder a proper 
►^ ' comparison. In this paper we present a new formalism for CFGs that highlights 

[^ . the similarities and differences between them. The new formalism borrows from 

C^^ ' PEGs the use of parsing expressions and the recognition-based semantics. We 

show how one way of removing non-determinism from this formalism yields a 
formalism with the semantics of PEGs. We also prove, based on these new for- 
^T ■ malisms, how LL(1) grammars define the same language whether interpreted as 

^— ^ ' CFGs or as PEGs, and also show how strong-LL(/c), right-linear, and LL-regular 

grammars have simple language-preserving translations from CFGs to PEGs. 
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1. Introduction 

Context-Free Grammars (CFGs) are the formalism of choice for describing 
the syntax of programming languages. A CFG describes a language as the set 
of strings generated from the grammar's initial symbol by a sequence of rewrit- 
ing steps. CFGs do not, however, specify a method for efficiently recognizing 
whether an arbitrary string belongs to its language and, when it belongs, what 
is its underlying syntactical structure, the string's parse tree; in other words, 
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a CFG does not specify how to parse the language, an essential operation for 
working with the language (in a compiler, for example). Another problem with 
CFGs is ambiguity, where a string can have more than one parse tree. 

Parsing Expression Grammars (PEGs) [1] are an alternative formalism for 
describing a language's syntax. Unlike CFGs, PEGs are unambiguous by con- 
struction, and their standard semantics is based on recognizing strings instead of 
generating them. A PEG can be considered both the specification of a language 
and the specification of a top-down parser for that language. 

The idea of using a formalism for specifying parsers is not new; PEGs are 
based on two formalisms first proposed in the early seventies, Top-Down Parsing 
Language (TDPL) ^ and Generalized TDPL (GTDPL) d]. PEGs have in 
common with TDPL and GTDPL the notion of limited backtracking top-down 
parsing: the parser, when faced with several alternatives, will try them in a 
deterministic order (left to right), discarding remaining alternatives after one of 
them succeeds. Compared with the older formalisms, PEGs introduce a more 
expressive syntax, based on the syntax of regexes, and syntactic predicates^ a 
form of unrestricted lookahead where the parser checks whether the rest of the 
input matches a parsing expression without consuming the input. 

In this paper, we argue that the similarities between CFGs and PEGs are 
deeper than usually thought, and how these similarities have been obscured by 
the way the two formalisms have been presented. PEGs, instead of a formalism 
completely unrelated to CFGs, can be seen as a natural outcome of removing 
the ambiguity of CFGs. 

We start with a new semantics for CFGs, using the framework of natural se- 
mantics y, |5| . The new semantics borrows the syntax of PEGs and is also based 
on recognizing strings. We make the source of the ambiguity of CFGs, their non- 
deterministic alternatives for each non-terminal, explicit in the semantics of our 
new non-deterministic choice operator. We then remove the non-determinism, 
and consequently the ambiguity, by adding explicit failure and ordered choice to 
the semantics, so now we can only use the second alternative in a choice if the 
first one fails. By that point, we only need to add the not syntactic predicate to 
arrive at an alternative semantics for PEGs (modulo syntactic sugar such as the 
repetition operator and the and syntactic predicate). We prove that our new 
semantics for both CFGs and PEGs are equivalent to the usual ones. 

We also show in this paper how our new semantics for CFGs gives us a way 
to translate some subsets of CFGs to PEGs that parse the same language. The 
idea is that, as the sole distinction between CFG and PEG semantics is in the 
choice operator, we will have a PEG that is equivalent to the CFG whenever we 
can make the PEG choose the correct alternative at each choice, either through 
reordering or with the help of syntactic predicates. We show transformations 
from CFGs to PEGs for three unambiguous subsets of CFGs: LL(1), Strong 
LL{k), and LL-regular. 

A straightforward correspondence between LL(1) grammars and PEGs was 
already noted [6;], but never formally proven. The correspondence is that an 
LL(1) grammar describes the same language whether interpreted as a CFG 
or as a PEG. The intuition is that, if an LL(1) parser is able to choose an 



alternative with a single symbol of lookahead, then a PEG parser will fail for 
every alternative that is not the correct one. We prove that this intuition is 
correct if none of the alternatives in the CFG can generate the empty string, 
and also prove that a simple ordering of the alternatives sufSces to hold the 
correspondence even if there are alternatives that can generate the empty string. 

There is no such correspondence between strong-LL(fc) grammars and PEGs, 
not even by imposing a specific order among the alternatives for each non- 
terminal. Nevertheless, we also prove that we can transform a strong-LL(fc) 
grammar to a PEG, just by adding a predicate to each alternative of a non- 
terminal. 

Our transformations lead to efficient parsers for LL(1) and strong-LL(fc) 
grammars even in PEG implementations that do not use memoization to guar- 
antee 0(n) performance. The resulting PEGs only use backtracking to test the 
lookahead of each alternative, so their use of backtracking is equivalent to a top- 
down parser checking the next k symbols of lookahead against the lookahead 
values of each production. 

For our transformation of LL-regular grammars, we first prove that right- 
linear grammars for languages with the prefix property have the same language 
whether interpreted as CFGs or as PEGs, then use this result to build lookahead 
expressions for the alternatives of each non-terminal based on which regular 
partition this alternative falls. 

The rest of this paper is organized as follows: Section 2 presents our new 
semantics for CFGs and PEGs, showing how to arrive at the latter from the 
former, and proves their correctness. Section 3 shows how an LL(1) grammar 
describes the same language when interpreted as a PEG, and proves this corre- 
spondence. Section 4 shows how a simple transformation generates a PEG from 
any strong-LL(fc) grammar, keeping the same general structure and describing 
the same language as the original grammar, and proves the latter assertion. 
Section 5 shows the equivalence between some right-linear CFGs and PEGs, 
and how this can be used to build a simple transformation that generates a 
PEG from an LL-regular grammar. Finally, Section 6 reviews related work, and 
Section 7 summarizes the paper's contributions and gives our final remarks. 

2. Prom CFGs to PEGs 

The traditional definition of a CFG is as a tuple {V, T, P, S) of a finite set 
V of non-terminals symbols, a finite set T of terminal symbols, a finite relation 
P between non-terminals and strings of terminals and non-terminals, and an 
initial non-terminal S. We say that A — > /3 is a production of G if and only if 
iA,(3)eP. 

A grammar G defines a relation =^q where 0:^47 ^q a/37 if ^^d only if 
A ^ /3 IS a production of G. The language of G is the set of all strings of 
terminal symbols that relate to S by the refiexive-transitive closure of =>g- We 
can interpret the relation =^g a-s a rewriting step, and then the language of G 
is the set of all strings of terminals that can be generated from S" by a finite 
number of rewriting steps. 



Empty ^^:^— (empty. 1) Terminal ^^^^— (char.l) 

G[e] X ^--^ X G[a\ ax -^ X 



CFG 



Non-terminal ^ ^ " — ^^ — - (var.l) 
G[A] xy -^ y 



^ , ,. G\pi] xyz -^ yz G\p2] vz ^^ z , _. 
Concatenation -^i-^ — ^^^ ^ (con.l) 

G[pip2\ xyz -^ z 



Choice ^[^-] "^ ZI (-hoice.l) ^'^^^ "^ ^Zl ('^hoice.2) 

G[pi I P2] xy -^ y G[pi I P2] xy -^ y 



We want to give a new definition for CFGs that is closer to PEGs, so the 
similarities between the two formalisms will be more visible. Our new defini- 
tion begins by borrowing the concept of a parsing expression from PEGs. The 
abstract syntax of parsing expressions is given below: 

p = e \ a \ A \ P1P2 \ pi\p2 

Parsing expressions are defined inductively as the empty expression e, a ter- 
minal symbol a, a non-terminal symbol A, a concatenation piP2 of two parsing 
expressions pi and P2, or a choice pi \ p2 between two parsing expressions pi and 

P2- 

We now define a PE-CFG (short for CFG using parsing expressions) G as 
a tuple {V,T,P,ps), where V and T are still the sets of non-terminals and 
terminals, but P is now a function from non-terminals to parsing expressions, 
and PS is the initial parsing expression of the grammar. 

Instead of the relation =>(3, we define a new relation, -^ , among a grammar 
G, a string of terminal symbols v, and another string of terminal symbols w. 
We will use the notation Gv '■^ w io say that {G,v,w) G '^ . The intuition 
for the "^ relation is that the first string is the input, and the second string is 
a suffix of the input that is left after G matches a prefix of this input. We will 
usually say G xy ^> y to mean that G matches a prefix x of input string xy. 

Figure [1] shows our semantics for ^^ using natural semantics, as a set of 
inference rules. G xy '-^ y if and only if there is a finite proof tree for it, built 
using these rules. The notation G[p'g] denotes a new grammar (V, T, P,p'g) that 
is equal to G except for the initial parsing expression ps , which is replaced by 
p'g. Each rule follows naturally from the intuition of ^^ : an empty parsing 
expression does not consume any input (empty. 1); a terminal consumes itself 



if it is the first symbol of the input (char.l); a non-terminal matches its cor- 
responding production in P (var.l); a concatenation first matches pi and then 
matches p2 with what is left of the input (con.l); and a choice can match ei- 
ther pi or p2 (choice. 1 and choice. 2, respectively). The rules guarantee that 
a Gv '^ w then u; is a suffix of v. 

The language of G, L{G), is now the set of prefixes that G matches, that is, 
all strings x where Gxy '•^ y for some string y. In the traditional definition of 
CFGs, the language of a grammar is the set of strings the grammar generates; 
in our new definition, the language is the set of strings the grammar matches. 
We could have defined the language of G as the set of strings x where Gx -^ e, 
that is, the set of strings that G matches completely, but it is a corollary of the 
following lemma that the two definitions are equivalent: 

Lemma 2.1. Given a PE-CFG G, if G[p\ xy ^^' y then we have "iy' .G[p] xy' -^ 



Proof. By induction on the height of the proof tree for G[p] xy '•^ y. D 

The previous lemma shows that the suffix in the relation ^^ is superfluous; 
we could have deflned relation ^-> as a binary relation between a grammar G 
and an input w, meaning just G recognizes w. We chose to keep the sufRx to 
emphasize the similarities between this semantics and our semantics for PEGs, 
where the sufRx matters. 

We need a way to systematically transform a traditional CFG G to a corre- 
sponding PE-CFG G', and vice- versa. To simplify our proofs, we will assume 
that grammars do not have useless symbols. The main obstacle for these trans- 
formations is the type of P. P is a relation for CFGs, with different productions 
for the same non-terminal being different entries in this relation. In PE-CFGs, 
however, P is a function, with all the different productions encoded as choices 
in the parsing expression for the non-terminal. 

The choice operator is commutative, associative, and idempotent; both left 
and right concatenations distribute over choice, that is, pi{p2 \ Ps) = P1P2 \ P1P3 
and {pi I P2)P3 = P1P3 I P2PaJ- So any parsing expression may be rewritten as 
a choice pi\ ... | p„, where the subexpressions pi, . . . ,p„ are distinct and do 
not have choice operators. We can then go from a PE-CFG G' to a CFG G 
using A — > pi, . . . ,yl -^ Pn as the productions of each non-terminal A^ where 
Pi, . . . ,Pn are the subexpressions obtained by rewriting the expression P'{A) in 
the way above. 

Going from a CFG G to a PE-CFG G' is easier: the right side of each pro- 
duction of G is a concatenation of non-terminals and terminals, which translates 
directly to a concatenation of parsing expressions (the concatenation of expres- 
sions is associative); we assign an arbitrary order to the productions of each 
non-terminal A of G, and then combine these productions right-associatively 
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into a choice expression, and this is P'{A). We wih caU the transformation of 
CFGs to PE-CFGs T, so TiG) = G'. 

As an example, take the CFG G with the following set of productions: 

P = {A^ BC, B ^a, B ^b, G -^ c, G ^ d, G ^ e} 

Its corresponding PE-CFG T{G) — G' has the following definition for the 
function P': 

P'{A) ^ BC P'{B) ^ a\h P'{C) = c\d\e 

We used the order that we listed the productions of G to order the choices, 
but commutativity and associativity of the choice operator guarantees that any 
other order would yield a grammar with the same language as G", so we could 
have used the following definition for P' instead: 

P'{A) ^ BG P'{B) = a\h P'{G) = e\c\d 

The proof that G and T{G) define the same language for any CFG G is a 
direct corollary of the following lemma: 

Lemma 2.2. Given a CFG G and its corresponding PE-CFG T{G) = G' , we 
have a =>g x if and only if G'[a\ xy ^ y, where x is a string of terminals and 
a is a string of terminals and non-terminals. 

Proof. (=>) By induction on the number of steps in the derivation of x. The 
base case, where a = x, is trivial, with an application of the empty. 1 rule or 
repeated applications of the con.l and char.l rules. 

The induction step has a composed of three parts: a prefix a', a non-terminal 
A, and a suffix 7, with a' A'y ^c C('l3j =>g x. By the properties of =>G: x can 
be decomposed into xi, X2 and X3 with a' =>g xi, 13 ^q X2, and 7 =>g 2^3- 
By the induction hypothesis we have G'[a'] xiX2Xjy ^^ X2Xjy, G'[/3] X2X^y '■^ 
Xzy, and G'['y]x3y -^ y. We combine these proof trees in a proof tree for 
G'[a'yl7] xy -^ y with rules con.l, var.l, and applications of the choice rules 
to select the alternative corresponding to production A — > /3. 

(<^) By induction on the height of the proof tree for G'[q\ xy ^^ y. The 
interesting case is var.l; we need to use the fact that the use of choice operators 
in G' follows a known structure, where each production is a right-associative 
choice of parsing expressions that do not have choice operators and correspond 
to the right side of productions in G. So the proof tree for G'[P'{A)] xy -^ y 
ends with a succession of choice rules that select which of the alternatives is 
taken for that non-terminal. We apply the induction hypothesis to the subtree 
above the last choice rule used, from the consequent to the antecedents. D 

A corollary of Lemma 12.21 is that S ^g x if and only if 7~(G) xy ^-> y, so 
the language of G and the language of T{G) are the same. 



A traditional CFG is ambiguous if and only if there is some string with more 
than one leftmost (or rightmost) derivation. We can define ambiguity for PE- 
CFGs via proof trees: a PE-CFG G is ambiguous if and only if there is more 
than one proof tree for G xy '■^ y for some x and y. 

We can show that a CFG G is ambiguous if and only if its corresponding 
PE-CFG T{G) is ambiguous. The proof is a corollary of the proposition that 
there is only one leftmost derivation for a =>g x if and only if there is only one 
proof tree for G'[a\ xy ^-> y, where a; is a string of terminals and a is a string 
of non-terminals and terminals. This proposition has a straightforward proof 
by induction (on the number of steps in the derivation and on the height of the 
proof tree), and the corollary follows by denial of the consequent. 

Ambiguity, in our semantics, is directly tied to the choice operator: if we 
try to prove that there cannot be more than one proof tree for a G[p\ xy '^ 
y, by induction on the height of the tree, our proof fails for case choice. 1, 
because even if there is only one proof tree for the G[pi] xy -^ y, we might 
have G[p2\xy -^ y, so we can get another proof tree for G[pi\p2\xy ^^ y by 
using choice. 2. The proof fails for case choice. 2 in a similar way. 

If we can change the semantics of choice so that a single proof tree for its 
antecedents guarantees a single proof tree for the choice then we will guarantee 
that all grammars will be unambiguous. Obviously we will not have CFGs 
anymore; in particular, we will invalidate Lemma 12.21 In fact, our changes will 
take us from CFGs to a restricted form of PEGs, and we will prove that our 
changed semantics is equivalent to the semantics of PEGs as defined by Ford [l| . 

We will make the choice operator ordered: in a choice pi \ p2 we try p2 only 
if pi does not match. But we need a way to have a proof tree for "pi does not 
match", so we will also introduce an explicit failure result, fail, to indicate 
the cases where a match is not possible. We will combine these changes in the 
semantics of a new relation -^ . Figure[2]lists its inference rules, where X means 
either fail or the remainder of the input string in a successful match. 

Just introducing fail does not change the semantics enough to be incom- 
patible with regular CFGs; if we take the semantics of Figure [2] and replace rule 
ord.2 with choice. 2 then we have a conservative extension of our PE-CFG se- 
mantics that introduces fail, so all our previous proofs remain valid. Ordered 
choice, represented by rule ord.2, is what changes the semantics so it is not 
representing CFGs anymore. A simple example that shows this change is the 
grammar G below: 

S -^ AB A -^ aba \a B ^ b 

We have G abac ^^ ac, but G abac f^ ac, as the only proof tree under ^^ 
for the input string abac is for Gabac "^ fail. 

We will use L™^ (G) for the language of a PE-CFG G interpreted with -^ ; as 
with '^ , this is the set of strings x for which there is a string y with G xy ^> y. 
Informally, this set is still the set of all the prefixes that G matches, only using 
'^> instead of ^^ . But there is no equivalent of Lemma |2. II for "^ ; for example. 



t:^ . / . 1^ TVT . • , G[PiA)] X ""-^ X , ^, 

Empty j^^— (empty. 1) Non-termmal ^^ (var.l) 

G[e] X -^ X G[A] X -^ X 

Terminal — ^^^ (char.l) -— ^^^—— . M « (char. 2) 

G[a\ ax '^^ X G|oJ ax ~* fail 

G[aJ e -\^ fail 

Concatenation —^^-^ — ^^-^ (con.l) 

G[pip2] xy "^ X 

Gbil^^-g^'fail 

PEG - - - ^ ' 



G[piP2] a; ~> fail 



Ordered Choice ^M^^^:;^ (ord.l) 
G[pi I P2] xy -^ y 



G[pi] a:;/ -^ fail G[p2] xy -^ y ^^^ 

G[pi I P2I xy ''-E? y 

G[pi] a; '^ fail G[p2] a; '''^ fail , . 
^^j^^ (ord.jj 

G[p\ I P2] a:^ "^ fail 



Figure 2: Natural semantics of •^-> 



the grammar above matches ah but fails for abac. 

Properties of the operators also change under -^^ : the choice operator is not 
commutative, and concatenation does not distribute over choice on the right 
anymore (although it still distributes on the left). 

In SectionEl we wih show a class of PE-CFGs where L{G) = L'''"^{G). For 
now, an interesting result is the following lemma, which proves that U'^'^iG) is 
a subset of L{G) for any PE-CFG G: 

Lemma 2.3. Given a PE-CFG G, if G[p] xy -^ y then we have G[p] xy -^ y. 

Proof. By induction on the height of the proof tree for G[p\ xy ^> y. The only 
rule that does not have an identical rule in ^> is ord.2, but it can trivially be 
replaced by choice. 2. D 

The intuition of Gx '^ fail is that G does not match any prefix of x 
(including the empty string). This is a corollary of the following lemma, which 
formally says that the result of G[p\ x is unique under ^> , for any G, p and a;: 



PEG 



Lemma 2.4. Given a PE-CFG G, if G[p\ x <i' X and G[p\ x ™' X' the 
have X = X' , and there is only one proof tree for G[p] x ^ X . 



-»T . T^ ,. , G\p] X -^ fail , , ,, G\p] a;v -^ y , , ^s 
Not Predicate -^ ^^ (not.l) ^"^^ p^^ ^ (not.2) 

G\\p] X -^ X G[!p] xy --» fail 

Figure 3: Natural Semantics of the not predicate 



PEG 



Proof. By induction on the height of the proof tree for G[p] a; '^ X. The 
interesting cases are ord.l and ord.2; for ord.l, the induction hypothesis rules 
out the possibihty of G[pi\ x -^ fail, so ord.2 cannot apply even if we have 
G[p2\x -^ X. For ord.2, we must have G'[pi] a; ^^ fail by the induction 
hypothesis; even if X is fail we cannot use rule ord.l. D 

The PE-CFGs that we will be dealing with in the rest of the paper will have 
an important property that becomes possible to express by introducing failure: 
they will be complete grammars 1]. A complete PE-CFG is one where for any 
expression p and any input x either G[p] x ^^ x' ot G[p] x ^^ fail. Ford [1| 
proves that any grammar that does not have direct or indirect left recursion (a 
property which can be structurally checked) is complete. 

A PE-CFG G is left-recursive when there is a non-terminal A of G and an 
input x where trying to derive a proof tree for G[A] x can make G[A] x appear 
again higher up in the tree. Because the semantics of ^^ is deterministic this 
means that a left-recursive PE-CFG may not have any proof tree for G[A] x 
under -^^ ; in this case, an implementation of PEGs that tries to match x with 
the expression G[A] will not terminate. 

Once we have failure and unambiguity, it is natural to introduce a way to turn 
a failure into a success. This is the not syntactic predicate {\p for any parsing 
expression p), which is a conservative extension of our semantics described in 
Figure [S] 

PE-CFGs extended with the not-predicate and interpreted using -^ are 
equivalent, syntactically as well as semantically, to PEGs. Ford [1] also in- 
cluded the repetition operator in the abstract syntax of PEGs, but eliminating 
repetition is a simple matter of replacing each repetition expression p* with a 
new non-terminal Ap with the production Ap — > pAp \ e, which is just a step up 
from simple syntactic sugar. 

Ford [1] defines the semantics of PEGs using a relation =>g that is similar 
to -^ . Unlike the relation ^q for traditional CFGs, Ford's =4>g is not a single 
step in the match, but the whole match. The notation {p,x) =>g {n,X), for 
(p, X, n, X) G =^Gj rneans that either the parsing expression p matches the prefix 
x' of input X, if X is a;', or the match fails, if X is fail. The number n is a 
step counter, used in proofs by induction involving =>g- 

Ford's definition of relation =^q is similar to our definition of the ^ relation, 
using a similar set of cases. The proof that (p, xy) =>g {n, x) if and only 
if G[p]xy ~> y and {p,xy) =>g (n, fail) if and only if G[p]xy -^ fail is 
straightforward, by induction on the step count n and on the height of the 



proof tree for G[p\ xy ^ X' . Our definition for the language of a PEG is 
different from Ford's, though. Ford defines U'^'^ as the set of strings for which 
a PEG recognizes some prefix of the string, while we use the set of strings that 
the PEG recognizes. In particular, the language of e is T* by Ford's definition 
and £ with ours. 

Ordered choice is what makes PEGs essentially different from CFGs, though, 
so one way to go from a CFG that can be parsed top-down to a PEG that parses 
the same language, without changing the structure of the grammar, is to make 
sure that the PEG always chooses the correct alternative at each choice. In 
Sections [31 HI and 5 we show how this intuition leads to translations from three 
classes of CFGs for top-down parsing, LL(1), Strong LL(/i:), and LL-regular, to 
equivalent PEGs. 

3. LL(1) Grammars and PEGs 

LL(1) grammars arc the subset of CFGs where a top-down parser can decide 
which production to use for a non-terminal by examining just the next symbol 
of the input. An LL(1) parser can then parse the whole input by starting with 
the initial non-terminal of the grammar and then choosing which production to 
apply, making a choice again whenever it encounters a non-terminal, without 
needing to backtrack on its choices. A correspondence between them and PEGs 
has already been noted [6] , but not formally proven, so they are a nice starting 
point for applying our new semantics of CFGs and PEGs to the task of finding 
translations from subsets of CFGs to corresponding PEGs. 

We will divide this task in two parts: first we will consider LL(1) grammars 
without e expressions and show that there is a correspondence between these 
grammars and PEGs. Then we will consider grammars with e expressions and 
show that there is a correspondence between these grammars and PEGs if the 
ordering of the choice expressions respects a simple property. 

In Section [2] we presented a method of translating a traditional CFG to a 
PE-CFG, a CFG using parsing expressions. That method generates PE-CFGs 
with a property that will be useful in the proofs for this section; we will formalize 
this property with the following definition; 

BNF structure A PE-CFG G = {V,T,P,ps) has BNF structure if it obeys 
the following properties: 

1. No choice expression of G is part of a concatenation expression; 

2. ps is a single non-terminal; 

3. For every choice pi \ p2 of G, if pi matches the empty string then p2 
must also match the empty string. 

Any traditional CFG G has a corresponding PE-CFG G" that has BNF struc- 
ture; in particular, it is trivial to ensure that T{G) always has BNF structure. 
Properties 1 and 2 of BNF structure are an obvious outcome of the transforma- 
tion T: the expression associated to each non-terminal is of the form pi\ . . . |p„, 
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where the choices associate to the right and pi , . . . , p„ do not have choice expres- 
sions, so property 1 appUes; the initial expression of G" is the initial non-terminal 
of G, so property 2 also applies; finally, property 3 can is guaranteed by choosing 
an order for the productions of each non-terminal of G so the productions that 
can generate the empty string are last. 

Any PE-CFG without BNF structure also can be rewritten to have it, by 
distributivity of concatenation over choice on the left and on the right, associa- 
tivity and comniutativity of choice, and the addition of an extra non-terminal 
to be the start expression. Throughout the rest of this section we will only 
consider PE-CFGs that have BNF structure in our definitions and proofs. 

A traditional CFG without e productions is LL(1) if and only if, for each of 
its non-terminals Ai, the FIRST sets for the right sides of the productions of 
Ai are disjoint. Before we can give a definition for LL(1) PE-CFGs, we need to 
define what is the FIRST set of a parsing expression. We will use the following 
definition for the FIRST set of an expression p with a PE-CFG G: 

FIRST^ip) = {aeT\ G[p] axy "^^ y] 

This definition of FIRST is equivalent to the definition for traditional CFGs 
for any parsing expression that does not have choice operators (that is, any pars- 
ing expression that has a corresponding string of terminals and non-terminals) . 
We can use the PE-CFG to CFG equivalence lemma (Lemma 12.21) to conclude 
that p =>G ax from T(G)[p]axy '^ y, where G is a traditional CFG. The 
FIRST set of p in the traditional definition is the set {a ET\p =^g aP}- As we 
assumed in Section 2 that G does not have useless symbols, this is the same as 
the set {a €T\p ^a ax}, and the two definitions of FIRST aie equivalent. 

We can now give a definition for LL(1) PE-CFGs without e expressions: a 
PE-CFG G without e expressions is LL(1) if and only if, for every choice pi \ p2 
in the grammar, the FIRST sets of pi and p2 are disjoint. 

It is straightforward to prove that a traditional CFG G without e productions 
is LL(1) if and only if its corresponding PE-CFG T{G) is also LL(1). The proof 
uses property 1 of BNF structure, associativity of choice, and the property that 
FIRST'^^^\pi I p2) = FIRST'^^^\pi) U FIRST'^^^\p2). 

Now that we have a definition for LL(1) PE-CFGs without e expressions, we 
can show that these grammars can be interpreted as PEGs (by the relation -^ ) 
without changing their language. The assertion that the language of an LL(1) 
grammar is the same whether interpreted as a CFG or as a PEG is a corollary 
of the following lemma: 

Lemma 3.1. Given an LL(1) PE-CFG G without e expressions, G[p\ xy '^ y 
if and only if G[p] xy ^ y. 

Proof. (=>) By induction on the height of the proof tree for G[p] xy ^^ y. 
The interesting case is choice. 2. For this case, we have G[p2] xy ^^ y. As 
G does not have e expressions, x cannot be empty. Let a be the first symbol 
of X. It is obvious that a e FIRST'^{p2), so a ^ FIRST'^{pi) by the LL(1) 
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property. So G[pi]xy f^ w, and, by denial of the consequent of Lemma | 
G[pi]xy f^ w. LL(1) grammars cannot have left recursion [7[, so they are 
complete and G[pi] xy f^ w implies G[pi\ xy "^ fail. With the induction 
hypothesis and the application of ord.2 we have G[pi \ P2] xy -^ y. 

(<^) Just a special case of Lemma [^751 D 

We will now show that a correspondence between LL(1) grammars and their 
corresponding PEGs still exists when we allow e expressions, as long as the 
LL(1) grammars have BNF structure. Grammars with e expressions can have s 
in the FIRST sets of their expressions, so we need a slightly different definition 
of FIRST: 

FIRST^ij)) = {aeT\ G[p\ axy ""^ y} U nullahle{p) 

nullahle{p)^{ J^^ if GH 2: "-^^ x 
1^ \D otherwise 

The LL(1) property for grammars with e expressions also uses a FOLLOW 
set, defined below: 

FOLLOW^ {A) = {a e T U {$} I G[A] yaz "-^ az is in a 

proof tree for Gw$ ^^ $} 

Like with the FIRST set, it is straightforward to prove that our definition of 
FOLLOW \s equivalent to the definition for traditional CFGs. The restriction 
involving the proof tree for G w% '^ $ of our definition proceeds directly from 
the fact that the traditional definition only uses derivations starting from the 
initial symbol of the grammar, and CFG derivations correspond to PE-CFGs 
proof trees. 

The general statement G xy "^ y ^ G xy ^^ y that we proved true for 
LL(1) PE-CFGs without e expressions is false for grammars with e expressions, 
as the following simple grammar shows: 

S ^ a\e 

This grammar is LL(1), and we have G a '■^ a through rule choice. 2, but 
G a f^ a, although simple inspection shows that the language of G is {a,e} 
whether interpreted as a CFG or as a PEG. 

We solve the above problem by introducing an end-of- input marker $ ($ ^ T), 
and using this marker to constrain proof trees so we only consider trees that 
consume the input and leave just the marker. Instead of trying to prove that 
G xy "^ y ^ G xy '^-> y, we will prove that Gx$ ^^ $ =^ Gx$ '^ $, which 
will still be enough to prove that G has the same language either interpreted as 
a PE-CFG or as a PEG. 

A PE-CFG G with e expressions is LL(1) if and only if the following two 
restrictions hold for every production A — >■ p of G and every choice pi | p2 of p: 
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1. FIRST^ipi) n FIRST^{p2) = 

2. FIRST^ipi) n FOLLOW^ {A) == if e e FIRST^{p2) 

This is a direct restatement of the LL(1) restrictions for traditional CFGs, 
and it is straightforward to show that a CFG G is LL(1) if and only if its 
corresponding PE-CFG G" is LL(1). 

We can now show that an LL(1) PE-CFG G has the same language whether 
interpreted as a CFG or as a PEG. The proof is a corollary of the following 
lemma: 

Lemma 3.2. Given an LL(1) PE-CFG G, if there is a proof tree for Gx$ ^^ $ 
then, for every subtree G[p] x'$ ^' x"$, we have that G[p] x'$ ^ x"$. 

Proof. By induction on the height of the proof tree for G[p] x'$ '^ a;"$. The 
interesting case is choice. 2. For this case, we have G[p2] x'$ -^ x"%. Because 
of BNF structure, this is a subtree of G[A\ x'% -^ x"% for some non-terminal A. 
By the definition of FOLLOW, the first symbol a of x"% is in FOLLOW^ {A). 
We now have two subcases, one where x' = x" and another where x' = bwx" . 

In the first subcase, we have a ^ FIRST (pi) by the second LL(1) re- 
striction. So G[pi] a;"$ f^ y and, by denial of the consequent of Lemma [2.31 
and completeness of LL(1) grammars, G[pi] x"$ ^^ fail. With the induction 
hypothesis and the application of ord.2 we have G[pi \ P2] x"$ -^ x"$. 

In the second subcase, where x' = bwx", we have b G FIRST (^2), so 
b ^ FIRST (pi) by the first LL(1) restriction. The rest of the proof is similar 
to the first subcase. D 

The proof that Ga;$ '^^ $ if and only if Gx$ ''-S? $ for any LL(1) PE-CFG 



G is now trivial, from the above lemma and from Lemma 12.31 

In the next section we will show how any strong-LL(fc) grammar can be 
translated to a PEG that recognizes the same language, while keeping the overall 
structure of the grammar. 

4. Strong-LL(fc) Grammars and PEGs 

Strong-LL(fc) grammars are a subset of CFGs where a top-down parser can 
predict which production to use for a non-terminal just by examining the next 
k symbols of the input, where k is arbitrary but fixed for each grammar. They 
are a special case of LL(fc) grammars, in which the parser can use both the next 
k symbols of the input and the history of which productions it already picked 
during parsing. 

Unlike LL(1) grammars, there are strong-LL(fc) grammars that have different 
languages when interpreted as CFGs and as PEGs, no matter how we order 
their choice expressions. For example, take the PE-CFG G with the following 
productions: 

S ^ A\B A^ ab\G B ^a\Cd G ^ c 
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G is a strong-LL(2) grammar, and its language, when interpreted as a CFG, 
is {a,ab,c,cd}. But interpreting G as a PEG yields the language {a,ab,c}; 
when matching cd, non-terminal A succeeds (through its second alternative, 
non-terminal G), and non-terminal B (the second alternative of S) is never 
tried. Changing S to S ^- B\A changes the PEG's language to {a, c, cd}, which 
is still different from the language of G as a CFG, because what happened to 
cd now happens to ab. 

Nevertheless, any strong-LL(A:) language can be parsed by a top-down parser 
without backtracking while using k symbols of lookahead. So it seems intuitive 
that we can use syntactic predicates to direct a PEG parser to the right alterna- 
tive. We cannot interpret G as a PEG and recognize the same language, but we 
can add predicates to G, to emulate the predictions that a strong-LL(fc) parser 
makes. 

An approach for translating a PE-CFG G to a PEG that recognizes the same 
language is to add an and-predicate (syntactical sugar for a double application 
of the not-predicate) in front of every alternative of a non-terminal; this and- 
predicate tests the next k symbols of the input against the possible lookahead 
values that a strong-LL(A:) parser would use for that alternative. For the strong- 
LL(2) grammar above, the translation results in the following PEG: 

5* ^ k{ab\c$)A I k(a$\cd)B 

A -^ k(ab)ab \ k{c$)C 

B -^ k(a$)a I k{cd)Cd 

C -^ k(cd\c$)c 

It is easy to check that this PEG recognizes the language {a,ab,c,cd}, the 
same as G, if we include the marker $ at the end of the input strings for the PEG. 
The formal definition of the translation does not add the predicate to the last 
alternative of a non-terminal (or to the sole alternative, in case of non-terminal 
G above). 

Before formalizing our translation and proving its correctness, we will give 
definitions of strong-LL(fc) properties using our new CFG formalism. First we 
need an auxiliary function takck, with the definition below: 



takckis) 
takek{ai . . .a„) 



ai . . . Ofc if n > fc 
oi . . . a„ otherwise 



We will say that takek{x) is the k-prefix of x. We also need to define •k, 
a language concatenation operation that results in fc-prefixes (i.e. concatenates 
each string of the first language with each string of the second language, taking 
the /c-prefix of each result): 

X ^kY ^ {taketix) \x ^X -Y} 

A property of fc-prefixes is that the fc-prefix of the concatenation of two 
strings is also the fc-prefix of the concatenation of their fc-prefixes (proof by case 
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analysis on the definition of takek): 

takek{xy) — takek{takek{x)takek{y)) 

Tliis leads directly to the following simple lemma, which we only include to 
reference in later proofs: 

Lemma 4.1. // takek{x) G X and takek{y) G Y then we have takek{xy) G 

Proof. Trivial. D 

We can now define the FIRSTk sets, the strong-LL(fc) analog of the LL(1) 
FIRST sets. The FIRSTk set of an expression p is the set of the fc-prefixes of 
every string that p matches: 

FIRSTfip) = {takekix) \ G[p] xy ""^ y} 

The definition of FOLLOW k sets is also a straightforward extension of the 
definition of FOLLOW sets for LL(1) grammars: 

FOLLOW'iiA) = [takekiy) \ G[A] xy ""-^ y is in a 

proof tree for Gw%^ -^ S''} 

To ensure that all members of FOLLOW k have length fc, we use k end-of- 
input markers $ ^ T instead of the single marker we used with LL(1) grammars. 
The semantics of ^^ guarantee that %*' is a suffix of y, so the length of y is at 
least k and the length of takekiy) is always k. 

We can now state the strong-LL(A:) property: a PE-CFG G with BNF struc- 
ture is strong-LL(A:) if and only if every choice expression pi | p2 of every pro- 
duction A ^f p satisfies the following condition: 

(FIRSTfipi) •k FOLLOWf{A)) n 
{FIRST^{P2) 'k FOLLOW^iA)) = 

The strong-LL(A:) property is just a formal way of saying that the next k 
symbols of the input are enough to choose among the choice expressions of a 
given non-terminal. 

We also need an auxiliary function choice that takes a set of strings and 
makes a choice expression with each string as an alternative of this choice: 

choice{^) — e 
choicei{pi,...,Pn}) = Pl\---\Pn 

We will use choice to transform a lookahead set into a lookahead expression. 
Our translation inserts lookahead expressions to direct the PEG parser to the 
correct alternative in a choice, so we only need to changes choice operations. 
Because we are assuming that our PE-CFGs have BNF structure, these choice 
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operations are at the "top-level" of each production. Intuitively, if pi | p2 is a 
choice of non-terminal A, ^^ipi |P2, ^) adds the lookahead expression C^{px,A) 
to pi and recursively transforms p2] any expression that is not a choice is not 
transformed: 



v'iie.A) -- 


= e 


•fkia.A) . 


= a 


^kiPlP2,A) = 


= P1P2 


v'iiB^A) 


= B 


wheYe£^{p,A) = 


= kchoice{FIRST^{p) •k 




FOLLOW^ {A)) 



The definition of our translation now is straightforward. From a strong- 
LL(fc) grammar G with BNF structure we can generate a PEG $&((?) (the before 
LL(fc)-PEG of G) by replacing each production A ^f p with A -^ 'fifip, A). 

To prove the correctness of the translation, we will use the same approach 
that we took in the proof for LL(1) grammars with e expressions. We will prove 
that in any derivation of G x$'' ^ $'^ all of the subparts of the derivation 
have correspondents in $b(G) via function (/j^. One subtlety of the proof is the 
parameter A of (p^; our definition of ^b{G) makes it clear that A in ip'^{p, A) is 
the non-terminal that "owns" the expression p. For an expression p that appears 
in a subpart of the derivation of G x$'' ^^ $'' as G[p], A is the first non-terminal 
that appears as G[A] in a path from this subpart to the conclusion G x$'^ -^ 
Formally, we can state the following lemma, a version of Lemma 



k CFG <jfc 



Lemma 4.2. Given a strong-LL(k) PE-CFG G, if there is a proof tree for 
G x%^ ^' %^ then, for every subtree G[p] x'%^ '^ x"%^ of this proof tree, we have 
^b{G)[f'j^{p,A)]x'%^ ^-> 0;"$'^, where A is the first non-terminal that appears 
as G[A] in a path from the conclusion G[p] x'$^ ■^' x"$'^ of the subtree to the 
conclusion G x$'' ^ S'^ of the whole tree. 

Proof. By induction on the height of the proof tree for G[p] 2;'$'^ -^ x"$''. 
The interesting cases are choice. 1 and choice. 2. For case choice. 1, we have 
G[pi] x'%'' "^ x"$''. Because of BNF structure, this is a subtree of G[A] x'$'= "^^ 
a;"$'=, and we have takekix"$'') G FOLLOW^iA) by the definition of i^OiiOW^fc. 
If we combine this with the Lemma [4. II we have takek{x'%^) £ FIRST'^ (pi) •k 
FOLLOW'^ {A), because the fc-prefix of what pi matches is in FIRST^ipi). 
So ^b{G)[C'^{pi,A)\x'%^ ''-S? x'%^ by the definition of C^ . By the induction 
hypothesis, $b(G)[pi] I'S'^ -^ x"%^, and with applications of rules con.l and 
ord.l we have $b(G)[<pf (pi | P2,A)] x'%^ "-^ x"%^. 

For case choice. 2, we can use the LL(/c) property and an argument similar 
to the one used in choice. 1 to conclude that takek{x'%'^) ^ FIRST'^{pi) •k 
FOLLOWfiA). So, by the definition of C^ , $fc(G)[£'^(pi, A)] x'S*^ ™ fail. 
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By the induction hypothesis, we have ^b{G)[(p'^{p2,A)]x'$'' ~> 0;"$'^, and by 
rules con.2 and ord.2 we have $b(G')[¥jf (pi | P2,A)] x'$'= ""-S? a;"$'=. D 

We also need to prove that for any strong-LL(A;) PE-CFG G we have G x$'^ -^ 
%'^ if ^b{G) x$'^ ^^ $'^. Intuitively, if (&pi)p2 matches a string x then p2 also 
matches x, so if we have a proof tree for $(,(G)a;$'^ -^ S*^ we will be able to 
erase all the predicates introduced by $h and build a proof tree for GxS*^ -^ S*^. 

First, let us define predicate erasure as follows: the erasure of (!pi)p2 is the 
erasure of P2- Any predicate occurring alone is replaced by e. All other ex- 
pressions just recursively erase predicates on their subparts. We get the erasure 
of a grammar by erasing the predicates in the right sides of every production, 
plus the initial symbol. The purpose of having a special case for the erasure of 
{lpi)p2 is to have the erasure of $b(G) be G. Now we can prove the following 
lemma: 



Lemma 4.3. Given a PEG G and an expression p, and the PE-GFG G' and 

expression p' obtained by erasing all predicates of G andp, ifG[p]xy "^' y then 
G'[p']xy ^'y. 



PEG 



Proof. By induction on the height of the proof tree for G[p]xy ^^ y. D 

The proof that ^b{G) has the same language as G is now a corollary of 
Lemmas S^l and 1131 

There is another approach for translating a strong-LL(/c) PE-CFG G to an 
equivalent PEG. This approach uses a subtle consequence of the strong-LL(fc) 
property: take the alternatives pi to p„ of a non-terminal A. Now let's say 
that two alternatives pi and pj both match prefixes of an input w, say Xj and 
Xj, with Xiy, = x.jy.j = w\ that is, G[pi]xiyi -^ yi and G[p.j\xjy.j -^ yj. By 
the definition of FIRST k, we have takek{xi) £ FIRSTf{pi) and takek{xj) G 
FIRST'^iPj). Therefore we cannot have both takefc(y,) G FOLLOWfiA) and 
takefc(j/j) £ FOLLOW/. (A), or we would violate the strong-LL(fc) property by 
having takefe(w) in both FIRST^{pi) -fc FOLLOW'iiA) and FIRST'^{pj) •k 
follow'^ (A) (Lemma [HI]). 

The fact that we cannot have the first k symbols of both yi and yj in 
FOLLOWkiA) is the core of this other approach, which is to add a guard 
after each alternative of a non-terminal A to test if the next k symbols of the 
input are in FOLLOW'^ {A). PEG's local backtracking then guarantees that 
the wrong alternative will not be taken even if it matches a prefix of the input. 

For the strong-LL(2) grammar we used as an example in the beginning of 
this section, this approach yields the following translated PEG: 

S -^ Ak{%%) I Bk{%%) 

A ^ abk{%%) I G&($$) 

B -^ a&($$) I C'd&($$) 

G ^ c&($$|d$) 
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It is easy to check that this PEG recognizes the correct language {a, ab, c, cd} 
if we include the marker $$ at the end of the input string. 

Like with our first translation, our second translation uses a function 0^ 
that translates the choice expressions in the production of a non-terminal A, 
adding an and-predicate built from a choice of every string in FOLLOW'^ (A) 
to the first half of the choice and recursively translating the second half. As with 
fk' 4'k is ^^^ identity function for other kinds of expressions, as the translation 
only changes choice expressions and we assume BNF structure: 



(t)^{pi\P2,A) = 


= pikchoice{FOLLOW'j^{A)) 




0^(P2,A) 


<pkie,A) -- 


— f 


(bf{a,A) . 


~ a 


(t>'^{piP2,A) -- 


= P1P2 


<l^nB,A) - 


= B 



The definition of the second translation is now straightforward. From a 
strong-LL(fc) grammar G with BNF structure we can generate a PEG $a(G) 
(the after LL(fc)-PEG of G) by replacing each production A —> p with A — >■ 

The following lemma is like Lemma |4?2] in that it proves that all subparts of 
a derivation for G x$'' -^ $^ have correspondents in ^a{G) via function (j)'^. As 
with lemmas |4?2] and [3T2l we need to restrict ourselves to matches that consume 
all input but the end-of-input marker S'^ so we can use the FOLLOW k sets of 
the non-terminals in our proof, and by extension the LL(fc) properties of G. 

Lemma 4.4. Given a strong-LL(k) PE-CFG G, if there is a proof tree for 
GxS'"' ^' $^ then, for every subtree G[p] x'S*^ ^^' x"$'^ of this proof tree, we have 
^a{G)[(j)'^{p, A)]x'$'' -^ x"$^, where A is the first non-terminal that appears 
as G[A] in a path from the conclusion G[p] 2;'$'^ -^' x"$'' of the subtree to the 
conclusion G x$'' ^^ $'' of the whole tree. 

Proof. By induction on the height of the proof tree for G[p] a^'S*^ ^-> x"$''. 
The interesting cases are choice. 1 and choice. 2. For case choice. 1, we have 
G[pi] x'S'^ "-^ x"$''. Because of BNF structure, this is a subtree oiG[A] x'$'' ""^ 
x"%'', and takek{x"$'') G FOLLOW'^ by the definition of FOLLOWk- It is 
then easy to see that ^a{G)[kchoiceiFOLLOW'j:{A))]x"$'' ^-^ x"$''. By the 
induction hypothesis, we have $q(G)[pi] 2;'$'^ -^ x"$'', and with applications 
of con.l and ord.l we have ^aiGM^ipi \ P2, A)] x'$'' ''-S? x"$''. 

For case choice. 2, if there is no w so $a(G)[pi] x'S'^ -^ w then we can 
conclude ^a{G)[pi]x'$^ -^ fail by completeness of LL(A:) grammars, and 
we can apply the induction hypothesis on p2 and rules con. 2 and ord.2 to get 
$a(G)[0^(pi \P2,A)] 2;'$'= ™ 2;"$^ Now suppose we have $a(G)[pi] x'$'= ™ w. 
Expression pi is predicate-free, so we have G[pi\ x'$^ -^ w hy Lemma l473l We 
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have takek{x"$'') G FOLLOW'^iA) by the definition of FOLLOWk, , and we 
have already seen that means takefc(w) ^ FOLLOW ^ (A) or the LL(fc) property 
is violated. So we have ^aiG)[k,choice{FOLLOW ^. (A))] w "^ fail, and we can 
apply the induction hypothesis on p2 and rules con.l and ord.2 to conclude 
^a{GMipi\p2,A)]x'$'' ^■^x"$K D 

The proof that <i>a(G) has the same language as G is a corollary of Lem- 
mas STi] and |131 

LL(1) grammars are a special case of LL(A:) grammars where k — 1, and every 
LL(1) grammar is also a strong-LL(l) grammar [Tj, so the two transformations 
we presented can also be used for LL(1) grammars, although the lookahead 
expressions become redundant. 

5. Right-linear and LL- regular Grammars 

A right-linear CFG is one where the right side of every production has at 
most one non-terminal, and this non-terminal can only appear as the last symbol 
of the production. Right-linear CFGs can only define regular languages, and 
any regular language R has a right-linear CFG G; it is straightforward to encode 
any NFA as a right- linear CFG, and vice- versa [8|. 

For PE-CFGs, we will define a right-linear PE-CFG as a CFG where the right 
side of every production is a right-linear parsing expression. We define right- 
linear parsing expressions as the following predicate on parsing expressions: e, 
a and A are right-linear; piP2 is right- linear if and only if pi is a terminal and p2 
is right-linear; pi \p2 is right-linear if and only if pi and p2 are right-linear. It is 
easy to see that any right-linear CFG has a corresponding right-linear PE-CFG, 
and vice- versa, as the transformations between CFGs and PE-CFGs we have on 
Section 2 preserve right-linearity. 

Right-linear PE-CFGs, in the general case, do not recognize the same Ian- 
guage when interpreted by '^^ and '•^ . An obvious example is the grammar 
5* — >■ a I aa, which is right-linear and has the language {a, aa} under ^ but {a} 
under -^^ . But the simplicity of right-linear grammars lets us prove an equiv- 
alence between right-linear CFGs and PEGs by adding a single restriction: the 
grammar's language must have the prefix property^ that is, there are no distinct 
strings x and y in the language such that a: is a prefix of y. This leads to the 
following lemma: 



Lemma 5.1. Given a right-linear PE-CFG G and a right-linear parsing expres- 
sion p, if L{G[p]) has the prefix property and G[p\ xy -^' y then G[p] xy ^ y. 



Proof. By induction on the height of the proof tree for G[p\ xy "^ y. The 
interesting cases are con.l and choice. 2. For case con.l, we have p = piP2, 
but as p is right-linear this means that pi — a and p2 is also right-linear. 
As p = ap2, we have L{G[p]) = a ■ L{G[p2]), so L{G[p2]) also has the prefix 
property, and we can use the induction hypothesis and rule con.l of ^* to get 
G[p] xy ""-S? y. 
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For case choice. 2, we have p = pi \p2, so L{G[p]) — L{G[pi]) U L{G[p2]), 
and both L{G[pi]) and L{G[p2]) have the prefix property. This means that 
either G[pi] xy ^^ y, so we can use the induction hypothesis (as pi is also right- 
hnear) and rule ord.l to get G[p\ xy "^ y, or there is no suffix z of xy with 
G[pi]xy -^ z, as that would violate the prefix property of L{G[pi]). In this 
case, we have G[pi\xy ^-> fail by modus tollens of Lemma 12.31 and we can 
now use the induction hypothesis with p2, which is also right- linear, and rule 
ord.2, to get G[p] xy -^ y. D 



The prefix-property restriction may seem overly restrictive, but we can ob- 
tain a right-linear grammar with the prefix property from any right-linear gram- 
mar G by applying the following transformation, where we use $ ^ T as an end- 
of-input marker, to the right side of G's productions and to its initial parsing 
expression: 

n(£) = $ 

n(a) = a$ 

n{A) = A 
n(piP2) = pin(p2) 
n(pib2) = n(pi)|n(p2) 

A (strong) LL-regular CFG [3, [13 is a generalization of LL(A:) CFGs where 
a predictive top-down parser may decide which alternative to take based on 
where the rest of the input falls on a set of regular partitions of T* . Formally, 
a CFG G is LL-regular if there is a regular partition tt of T* such that for any 
two leftmost derivations of the following form, ii x = y (mod n) then j — S: 

S ^G wiAai =>G wijai =>g wix 
S =>G W2Aa2 ^G Wi6a2 ^g W2y 

Restating the definition for PE-CFGs is straightforward. First we introduce 
the BLOCK^ set that tells in which blocks of partition tt the input for p falls: 

block'; {p, A) = {Bfc e TT I G[p] xy ""-S? y and G[A] xy "-5= y 



are in a proof tree for G 
and xy G Bk} 



w3 



CFG 



A PE-CFG G with BNF structure is LL-regular if and only if there is a 
partition tt of T* • {$} such that every choice expression pi \ p2 of every production 
^ ^p has BLOCK^{pi,A) n BL0CK^{p2,A) = 0. 

The original definition of LL-regular grammars uses the partition where the 
rest of the input falls to predict alternatives, so our addition of an end-of-input 
marker $ is not changing the class of grammars we are defining, while ensuring 
that the blocks of the regular partition have the prefix property. 
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Any strong-LL(fc) grammar is also LL-regular llj, so a simple reordering 
of alternatives is not sufficient for obtaining a PEG that recognizes the same 
language as an LL-regular grammar G. But we can use the same approach 
we used in the translation $& to translate an LL-regular grammar G into a 
PEG TZ{G) that recognizes the same language, using an and-predicate to add a 
lookahead expression to the front of the alternatives of each choice. 

We assume that each block Bk of the regular partition tt has a corresponding 
right-linear grammar Gb,. , where the intersection of the non-terminal sets of G 
and of all these grammars is empty; the non-terminal set of "/^(G) is the union 
of these sets. 

We form the regular lookahead of an alternative p, Cf {p, A) , by making a 
choice of the grammars for each block in BLOCK^{p,A), and wrapping this 
choice in an and-prcdicate: 

£^{p,A) = kchoice{{SB,\Bk e BLOCK^{p,A)}) 

where choice is the function we used to build the lookahead expressions for ipb 
and Ipb- 

Function p^{p,A) adds lookahead expressions where necessary, assuming 
that the original grammar has BNF structure: 

p''{P2,A) 



p^{pi\p2,A) = 


= C^ipi,A)p^ 


p«(e,A) = 


— f- 


p«(a,A) = 


= a 


p^{piP2,A) -- 


= P1P2 


p^{B,A) = 


= B 



We obtain the productions of 7^(G) by applying p'^ to the right-side of each 
production of G, and then adding the productions for each Gb^ ■ We can prove 
a lemma similar to Lemma [ 



Lemma 5.2. Given an LL-regular PE-CFG G, if there is a proof tree for 
GxS -^> $ then, for every subtree G[p\ x'% ^ x"%, we have TZ{G)[p^{p, A)] x'$ -^^ 
a:"$, where A is the first non-terminal that appears as G[A] in a path from the 
conclusion G[p]x'$ ^ x"$ of the subtree to the conclusion Gx$ -^'1 of the 
whole tree. 

Proof. By induction on the height of the proof tree for G[p] x'$ ^ x"$. The in- 
teresting cases are choice. 1 and choice. 2. For choice. 1, we have G[pi\ x'$ ^ 
x"$. Because of the BNF structure, this is a subtree of G[A] x'$ -^ x"$, and 
we have x'$ in some block Bk G BLOCK^ipi,A). So x'$ G HGbJ- It is 
easy to see that x'$ G L{choice{{SBjBk G BLOCK'^ {p, A)})); this grammar is 
right-linear, so by Lemma 15.11 and the semantics of the and-predicate we have 
TZ{G)[C^{pi,A)]x'$ ^-^ x'$. We have n{G)[pi]x'% ''■^ x"$ by the induction 
hypothesis, and can use rules con.l and ord.l to get TZ{G)[p'^ {pi \p2, A)] x'$ -^ 
x"$. 
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For case choice. 2, we can use the LL-regular property to conclude that 
the block Bk with x'% £ Bk is not in BLOCK^{pi,A), and then use the 
definition of C'^ and an argument similar to the one used in choice. 1 to 

get TZ{G)[jCf{pi,A)]x'$ ^> fail. We can use the induction hypothesis to 
get TZ{G)[p^{p2,A)]x'$ -^ a;"$, and then use rules con. 2 and ord.2 to get 

n{G)[p'^{pi\p2,A)]x'$''-^ x"$. D 

The proof that Tl{G) has the same language as G is a corollary of Lem- 
mas [52] and |4?3l We can use Lemma [4.31 even though TZ{G) has non-terminals 
that are not present in G; these extra non-terminals are only referenced inside 
predicates, so they become useless when the predicates are removed, and can 
also be removed. 

6. Related Work 

Parsing Expression Grammars have generated much academic interest since 
their introduction by Ford [l], with over forty citations of this paper in ACM's 
Digital Library. But just a few of these works are concerned with the theory 
of PEGs and their relation to other parsing tools; this section discusses these 
works and how they relate to our work. 

Ford ll| leaves open the problem of how PEGs and CFGs relate, and does 
not outline a strategy to solve this problem. The solution of this problem for 
the major classes of top-down CFGs is a contribution of our work, and our 
recasting of the CFGs in a recognition-based formalism is another contribution 
that shows where CFGs and PEGs diverge. 

Redziejowski ^ adapts the FIRST and FOLLOW relations of CFGs to 
the study of Parsing Expression Grammars, defining PEG analogs of these two 
relations. Redziejowski then uses the analogs for a conservative approximation 
of when an ordered choice is commutative or not, by defining a PEG analog of 
the LL(1) restriction. He admits that the approximation is too conservative for 
practical use, specially because of its treatment of syntactic predicates. We do 
not attempt to give definitions of FIRST and FOLLOW ioi PEGs, limiting our 
redefinitions of these relations just to our PE-CFG formalism, and using them to 
prove correspondences between LL(1) and strong-LL(fc) CFGs and structurally 
similar PEGs, something that Redziejowski does not explore. 

An earlier work by Redziejowski [13] presents several identities regarding the 
languages defined by PEGs, although the author concludes that the identities 
are only useful for obtaining approximations of a PEGs language, and he also 
does not try to relate the languages of PEGs and of CFGs. 



Schmitz [ij] presents both an ambiguity detection algorithm for CFGs in 
the context of the SDF2 formalism, an extension of CFGs, and an algorithm for 
detecting whether an ordered choice in a PEG is commutative or not. Schmitz 
notes that an overly strict ambiguity detector can be as restrictive as introducing 
ordering, but does not attempt to further study the relation between CFGs and 
PEGs. 
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Parr and Quong [15j add seraantic and syntactic predicates to LL{k) gram- 
mars to get the pred-LL(k) parsing strategy. The syntactic predicates of pred- 
LL(fc) are only used in productions that have LL(fc) conflicts. In these produc- 
tions, the parser tries to match the predicates of each alternative in the order 
they are given in the grammar definition, choosing the first alternative with a 
predicate that succeeds. Backtracking is strictly local, as with PEGs, so a sub- 
sequent failure does not make the parser try other productions. The paper does 
not give a formal specification of pred-LL(fc) grammars, nor how they relate to 
the class of LL{k) grammars. 

Parr and Fisher [lg| introduce the LL(*) parsing strategy, which uses the 
basic idea of LL-regular grammars, with a predictive top-down parser for these 
grammars that uses a deterministic finite automata to select which alternative 
of a non-terminal to take. If the grammar is not LL-regular, the LL(*) parser 
can use semantic and syntactic predicates with local backtracking, as in pred- 
LL(fc), and also automatically introduce predicates, via a suitably named "PEG 
mode" . 

7. Conclusions 

We presented a new formalism for context-free grammars that is based on 
recognizing (parts of) strings instead of generating them. We adopted a subset of 
the syntax of parsing expression grammars, and the notion of letting a grammar 
recognize just part of an input string, to purposefully get a definition for CFGs 
that is closer to PEGs, yet defines the same class of languages as traditional 
CFGs. These PE-CFGs define the same class of language as traditional CFGs, 
and simple transformations lets us get a PE-CFG from a CFG and vice- versa. 

Our semantics for PE-CFGs has a non-deterministic choice operation. We 
showed how a deterministic choice operation based on the notions of failure 
and ordering turns PE-CFGs into quasi-PEGs; the addition of a not syntactic 
predicate then gave us a semantics that is equivalent to Ford's original semantics 
for PEGs. We then used our new formulations of CFGs and PEGs to study 
correspondences between four classes of CFGs and PEGs: LL(1), strong-LL(/c), 
right-linear and LL-regular. We proved that LL(1) grammars already define the 
same language either interpreted as CFGs or as PEGs, as was already suspected, 
and gave transformations that yield equivalent PEGs for the grammars in the 
other three classes. 

All our transformations preserve the structure of the original grammars; we 
do not change or remove non-terminals, nor change the alternatives of a non- 
terminal, just add predicates to the beginning or the end of each alternative. 
This means that our transformations have a practical application in reusing 
grammars made for one formalism with tools made for the other, as conserving 
the structure of the grammar makes it easier to carry semantic actions from one 
tool to another without modification. 

Our transformations assume the presence of some kind of end-of-input marker, 
and incorporate this marker in the resulting PEG. This does not affect our equiv- 
alence results, but it does have implications in composability of PEGs resulting 
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from our transformations. To use a PEG obtained from one of our transfor- 
mations as part of a larger PEG (for embedding one language in another, for 
example) requires a suitable "end-of-input marker" to be picked (the bound- 
aries between languages have to be explicit). We believe this problem should 
be easily solvable in practice. 

There have been attempts to change PEG implementations so they work 



with some left- recursive PEGs [17|, [18| , but they both suffer from not specifying 
what is the intended semantics of left-recursive PEGs; left-recursion elimination 
in CFGs intended for top-down parsers benefits from the grammars having a 
clear semantics as CFGs, so a proof that left-recursion elimination allows a top- 
down parser to recognize the same language as the original CFG is possible. 
This is not the case with any scheme for allowing left-recursive PEGs, as by 
design they change the language described by the PEG. One solution to this 
problem is to start with a left-recursive CFG, then eliminate left recursion using 
one of the sound techniques for doing so in CFGs, and then get a equivalent 
PEG using one of the lemmas in this paper. 

References 

[1] B. Ford, Parsing Expression Grammars: a recognition-based syntactic foun- 
dation, in: Proceedings of the 31st ACM SIGPLAN-SIGACT Symposium 
on Principles of Programming Languages, POPL '04, ACM, New York, 
NY, USA, 2004, pp. 111-122. 

A. Birman, J. D. Ullman, Parsing algorithms with backtrack, Information 
and Control 23 (1) (1973) 1 - 34. 



[3, 



[s: 



A. V. Aho, J. D. Ullman, The Theory of Parsing, Translation, and Com- 
piling, Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1972. 

G. Kahn, Natural semantics, in: Proceedings of the 4th Annual Symposium 
on Theoretical Aspects of Computer Science, STAGS '87, Springer- Verlag, 
London, UK, 1987, pp. 22-39. 

G. Winskel, The Formal Semantics of Programming Languages: An Intro- 
duction, Foundations of Computing, MIT Press, 1993. 

C. F. Clark, Message to comp. compilers reference 05-08-115, 
|http : //compilers . iecc . com/coinparch/article/05-09-009| (2005). 

D. Grune, C. J. Jacobs, Parsing Techniques - A Practical Guide, Ellis 
Horwood, 1991. 

J. E. Hopcroft, J. D. Ullman, Introduction to Automata Theory, Lan- 
guages, and Computation, 1st Edition, Addison- Wesley Longman Publish- 
ing Co., Inc., Boston, MA, USA, 1979. 

A. Nijholt, LL-regular grammars, International Journal of Computer Math- 
ematics 8 (1980) 303-318. 



24 



[10] S. Jarzabek, T. Krawczyk, LL-regular grammars, Information Processing 
Letters 4 (2) (1975) 31-37. 

[11] A. Nijholt, From LL-regular to LL(1) grammars: Transformations, covers 
and parsing, RAIRO Informatique Theorique et Applications 16 (1982) 
387-406. 

[12] R. R. Redziejowski, Applying classical concepts to parsing expression gram- 
mar, Fundamenta Informaticac 93 (2009) 325-336. 

[13] R. R. Redziejowski, Some aspects of parsing expression grammar, Funda- 
menta Informaticac 85 (2008) 441-451. 

[14] S. Schmitz, Modular syntax demands verification. Tech. Rep. I3S/RR-2006- 
32-FR, Laboratoire I3S, Universite de Nice-Sophia Antipolis, France (Oct. 
2006). 

[15] T. J. Parr, R. W. Quong, Adding semantic and syntactic predicates to 
LL(k): prcd-LL(k), in: Proceedings of the 5th International Conference on 
Compiler Construction, CC '94, Springer- Verlag, London, UK, 1994, pp. 
263-277. 

[16] T. Parr, K. Fisher, LL(*): the foundation of the ANTLR parser generator, 
in: Proceedings of the 32nd ACM SIGPLAN conference on Programming 
language design and implementation, PLDI '11, ACM, New York, NY, 
USA, 2011, pp. 425-436. 

[17] A. Warth, J. R. Douglass, T. Millstein, Packrat parsers can support left re- 
cursion, in: Proceedings of the 2008 ACM SIGPLAN Symposium on Partial 
Evaluation and Semantics-based Program Manipulation, PEPM '08, ACM, 
New York, NY, USA, 2008, pp. 103-110. 

[18] L. Tratt, Direct left-recursive parsing expression grammars. Tech. Rep. 
EIS-10-01, School of Engineering and Information Sciences, Middlesex Uni- 
versity (Oct. 2010). 



25 



